With some small exceptions, John wrote everything. Jared started importing the data, and John worked with Maria in building the models, but I (John) seriously spent like four days on this thing. Maria wrote some linear models that ended up not being used.
This program will perform sampling on toy distributions if you let it, and these take a little time (~1min), and then never run again. Running this as an ipython notebook should not be especially taxing. The notebook will not run full mcmc, and I ensured this.
(c) 2018 Justin Bois. With the exception of pasted graphics, where the source is noted, this work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.
This document was prepared at Caltech with financial support from the Donna and Benjamin M. Rosen Bioengineering Center.
Remind yourself about Homework 4.2, in which you computed the growth and division events of Caulobacter crescentus over time using date from this paper. In this problem, you will use your results from the image processing of those data sets to perform parameter estimation of the growth rates of the individual mother cells and also determine if bacterial growth on a single cell level is linear or exponential. You should use your own results form homework 4 for this problem, but in the event that you had trouble getting those results, you can use these results from the HW4 solutions.
We know that under ideal conditions, bacterial cells experience exponential growth in bulk. That is, the number of cells grows exponentially. This is possible regardless of how individual cells growth; the repeated divisions lead to exponential growth. In their paper, the authors argue that the growth rate of each cell is also exponential. I.e.,
\begin{align} a(t) = a_0 \mathrm{e}^{k t}, \end{align}
where $a(t)$ is the area of the cell in the image as a function of time and $a_0$ is the area of the cell right after a division has been completed, which we mark as $t = 0$.
As an alternative model, the authors consider a linear growth model, in which
\begin{align} a(t) = a_0 + b t. \end{align}
An exponential curve is approximately linear (with $b = a_0k$) for short time scales. So, it is often difficult to distinguish between a linear and an exponential growth. Your goal is to perform parameter estimates and do an effective comparison between these two models for growth. You should use hierarchical models, and be sure to take a principled approach in your model construction and evaluation.
Since you are using a hierarchical model, here are a few tips for building and implementing the models. You do not need to take this advice if you do not want to, but I have found that these strategies help.
import numpy as np
import pandas as pd
import pystan
import sys
# Import Altair for high level plotting
import altair as alt
import altair_catplot as altcat
# Pevent bulky altair plots
alt.data_transformers.enable('json')
import bebi103
import bokeh
bokeh.io.output_notebook()
Let's load up the datasets from 4.2 and check their structure.
# Data for bacteria 1
df_bac1 = pd.read_csv("./bac_1_areas.csv")
df_bac1 = df_bac1.drop(["Unnamed: 0"], axis = 1)
# Data for bacteria 2
df_bac2 = pd.read_csv("./bac_2_areas.csv")
df_bac2 = df_bac2.drop(["Unnamed: 0"], axis = 1)
df_bac1.head()
We graph the area over time to get a sense of it.
alt.Chart(df_bac1).mark_point().encode(
x=alt.X('label', title = "Time (min)"),
y=alt.Y('area (sq µm)', scale=alt.Scale(zero=False)),
).interactive()
And now for bacteria 2:
df_bac2.head()
And graph:
alt.Chart(df_bac2).mark_point().encode(
x=alt.X('label', title = "Time (min)"),
y=alt.Y('area (sq µm)', scale=alt.Scale(zero=False)),
).interactive().properties(height=300,
width=900)
Remove unphysical outliers:
df_bac1["isOutlier"] = ((df_bac1['area (sq µm)'] < 500) | # Drop low-area outliers
(df_bac1['area (sq µm)'] > 1500)) # Drop high-area outliers
df_bac2["isOutlier"] = ((df_bac2['area (sq µm)'] < 500) | # Drop low-area outliers
(df_bac2['area (sq µm)'] > 1500)) # Drop high-area outliers
bac1_thresh = df_bac1[df_bac1["isOutlier"] == False]
bac2_thresh = df_bac2[df_bac2["isOutlier"] == False]
bac1_thresh = bac1_thresh.drop(columns = ["isOutlier"])
bac2_thresh = bac2_thresh.drop(columns = ["isOutlier"])
Importing functions from 4.2
def find_divisions(df):
"""Returns an array that contains whether each point in a dataframe is a division or not."""
# I will start by identifying the first division.
# We can see from the plots that this occurs in the first 30 frames.
areas = df['area (sq µm)'].values
isDivision = np.zeros(len(df.index), dtype = bool)
div_1 = np.argmin(areas[:30])
isDivision[div_1] = True
for i in range(30, len(df.index) - 30):
region = np.zeros(61)
for j in range(0, 61):
region[j] = areas[i - 30 + j] # samples the 10 points on either
# side of the point of interest.
if ((np.argmin(region) == 30) and region[30] < 780):
isDivision[i] = True
return isDivision
Create "isDivision" columns:
bac1_thresh.loc[:,"isDivision"] = find_divisions(bac1_thresh.copy())
bac2_thresh.loc[:,"isDivision"] = find_divisions(bac2_thresh.copy())
plot1 = alt.Chart(bac1_thresh).mark_point().encode(
x=alt.X('label', title = "Time (min)"),
y=alt.Y('area (sq µm)', scale=alt.Scale(zero=False)),
color = 'isDivision:N',
).interactive().properties(title = "Bacteria 1")
plot2 = alt.Chart(bac2_thresh).mark_point().encode(
x=alt.X('label', title = "Time (min)"),
y=alt.Y('area (sq µm)', scale=alt.Scale(zero=False)),
color = 'isDivision:N',
).interactive().properties(height=300,
width=900,
title = "Bacteria 2")
plot1 & plot2
I am curious how clean this data really is. I will try to color code the plot based on division number.
def get_div_num(df):
is_div = df["isDivision"].values
div_num = np.zeros(len(is_div))
for i in range(0, len(is_div)):
div_num[i] = np.sum(is_div[:i + 1])
return div_num
bac1_thresh.loc[:,"div_num"] = get_div_num(bac1_thresh.copy())
bac2_thresh.loc[:,"div_num"] = get_div_num(bac2_thresh.copy())
plot1 = alt.Chart(bac1_thresh).mark_point().encode(
x=alt.X('label', title = "Time (min)"),
y=alt.Y('area (sq µm)', scale=alt.Scale(zero=False)),
color = alt.Color('div_num:N',legend=None),
).interactive().properties(title = "Bacteria 1")
plot2 = alt.Chart(bac2_thresh).mark_point().encode(
x=alt.X('label', title = "Time (min)"),
y=alt.Y('area (sq µm)', scale=alt.Scale(zero=False)),
color = alt.Color('div_num:N',legend=None),
).interactive().properties(height=300,
width=900,
title = "Bacteria 2")
plot1 & plot2
The first bacterium's lineages look beautiful, but the first lineage is only two points, so I'll delete it. The first two lineages for the second bacteria look like one lineage; I'll trim some points off it and merge the lineages.
# Remove lineage one for bac 1
bac1_thresh.loc[:,"div_num"] = bac1_thresh.copy()["div_num"].values - 1
bac1_thresh = bac1_thresh[bac1_thresh["div_num"] >= 0]
# Merge lineage and trim for bac 2
bac2_thresh = bac2_thresh[1:]
div_num = bac2_thresh["div_num"].values
for i in range(0, len(div_num)):
if div_num[i] == 0:
div_num[i] = 1
div_num = div_num - 1
bac2_thresh.loc[:,"div_num"] = div_num
The second bacterium's growth also looks good, with some errors near the cell divisions. We know we got the start correct, so this noise is probably because the image processing has trouble differentiating whether the cell has actually divided in the region about the division. These errors all occur above 950, so I'll just set any point near (but past) a division that is above that threshold to belong to the previous lineage. Furthermore, there are similar errors near the beginnings of divisions. If a point is in the last ten points in a lineage and is below 700, it should be in the next lineage.
is_div = bac2_thresh["isDivision"].values
div_num = bac2_thresh["div_num"].values
areas = bac2_thresh['area (sq µm)'].values
for i in range(0, len(bac2_thresh.index)):
if is_div[i] == True:
for j in range(0, 30):
if areas[j + i] > 950:
div_num[j + i] = div_num[j + i] - 1
if div_num[i] > 1:
for j in range(-10, 0):
if areas[j + i] < 700:
div_num[j + i] = div_num[j + i] + 1
bac2_thresh.loc[:,"div_num"] = div_num
We noticed a few glaring errors and removed them.
# Manually delete errors in image segmentation
bac1_thresh = bac1_thresh.drop([139, 1632])
Let's re-make the plot and see if we solved the problems!
plot1 = alt.Chart(bac1_thresh).mark_point().encode(
x=alt.X('label', title = "Time (min)"),
y=alt.Y('area (sq µm)', scale=alt.Scale(zero=False)),
color = alt.Color('div_num:N',legend=None),
tooltip = ["label", "isDivision", 'area (sq µm)']
).interactive().properties(title = "Bacteria 1")
plot2 = alt.Chart(bac2_thresh).mark_point().encode(
x=alt.X('label', title = "Time (min)"),
y=alt.Y('area (sq µm)', scale=alt.Scale(zero=False)),
color = alt.Color('div_num:N',legend=None),
tooltip = ["label", "isDivision", 'area (sq µm)', 'div_num']
).interactive().properties(height=300,
width=900,
title = "Bacteria 2")
plot1 & plot2
Yay! Now the data looks ready for modeling! The format we want in order to send this into Stan is a little tricky. Stan only wants arrays of a fixed length, so we will need separate arrays that specify:
Each of these arrays must be the length of our entire data set. I will set about making these from the above data.
bac1_thresh.loc[:,"bac_id"] = np.zeros(len(bac1_thresh.index))
bac2_thresh.loc[:,"bac_id"] = np.ones(len(bac2_thresh.index))
# Concatenate this into a single dataframe
data = pd.concat([bac1_thresh, bac2_thresh])
data = data.rename(columns = {"label":"time (min)"})
data = data.drop(columns = ["isDivision"])
The time column is currently a disaster. We need to isolate each series and then subtract the start time of the series from all time points in the series. I will write a function to find the min, and then calculate the min for every point in the dataframe. This is a slow way to do this, but it's easy to think about and possible since we don't have hundreds of thousands of data points.
def find_min_time(data, index):
"""Finds the min time in the dataframe data for the series
associated with the entry at index 'index'. """
bac_num = data["bac_id"].values[index]
div_num = data["div_num"].values[index]
series = data[data["bac_id"]==bac_num]
series = series[series["div_num"]==div_num]
series_times = series["time (min)"].values
return np.min(series_times)
times = data["time (min)"].values
new_times = np.zeros(len(data.index))
for i in range(0, len(new_times)):
new_times[i] = times[i] - find_min_time(data.copy(), i)
data["time (min)"] = new_times
Let's sanity-check the non-trivial operation that I just did by making a few plots.
plots = [0]*4
# Series to plot were chosen arbitrarily
for i in range(0, 4):
data_plot = data[data["bac_id"] == (i % 2)]
data_plot = data_plot[data_plot["div_num"] == i * 2 + 1]
time = data_plot["time (min)"].values
area = data_plot['area (sq µm)'].values
p = bokeh.plotting.Figure(height=300, width=200,
title = "Bacteria %i Division %i" % (i % 2, i * 2 + 1),
x_axis_label = "time (min)",
y_axis_label = "cell area (sq µm)")
p.circle(time, area)
plots[i] = p
bokeh.io.show(bokeh.layouts.row(plots))
data.head()
Yay!! All our data is plotted on a time axis that makes intuitive sense! The last wrangling I want to do is to make the division number more informative. It would be best if it were unique across the entire dataframe, and also if it started at one. This way, it can be used for indexing in stan.
# I'll make bac_id 1 and 2 for stan indexing also
data.loc[:,"bac_id"] = data["bac_id"].values + 1
# make div_num unique
bac_id = data["bac_id"].values
max_div_num = np.max(data[data["bac_id"]==1]["div_num"].values)
div_num = data["div_num"].values
for i in range(0, len(div_num)):
if bac_id[i] == 2:
div_num[i] = div_num[i] + max_div_num + 1
div_num = div_num + 1
data.loc[:,"div_num"] = div_num
data.head()
This dataframe is valuable, and I want to be able to access it from external python files, which will be able to run MCMC faster than a jupyter kernel. I'll save it to a CSV, and then reload it for consistency.
data.to_csv("./9.1_data_wrangled.csv")
Our dataframe is in a format such that stan will be able to interpret the columns. We may now focus on the stan model!
Our hierarchical model will have 3 layers:
I will first focus on the exponential model, which states: $$\text{Cell mass ~ Cell Area } = a_0e^{kt}$$
The image processing gave us error that can be erratic, so error at the lower levels of the model would probaby not be best modeled as guassian. The Student-t distribution is a better choice in this case, and we will parametrize the distribution differently for each cell. Using this model and plugging in the exponential as the peak of the Student-t, our model predicts: $$\text{Cell Area ~ student_t}(\nu_{2,i}, (a_0)_{2, n, i}e^{k_{2, n, i}t}, \sigma_{2,i})$$
where subscripts are of the form: $\nu_{\text{model layer}, \text{ cell number}}$ and $(a_0)_{\text{model layer}, \text{ growth event}, \text{ cell number}}$ ($n$ is growth event, $i$ is cell number). We can immeditely come up with priors on $\nu$ and $\sigma$. The heaviness of the tails can vary a lot, so I'll choose a broad prior on $\nu$, $$ \nu_{2, i} \sim 1 + 10 * \text{lognormal}(0.1, 1.0)$$ The measurement error might vary by about 20 µm$^2$, so I'll make sigma half-normal with a standard deviation of 20. $$ \sigma_{2, i} \sim 20 * \text{halfnormal}(0, 1)$$
The rest of the hierarchy will be less complicated. The parameters for the individual growth events will be drawn from the distributions above in the following way:
$a_0$ from the bottom layer is found from the $a_0$ above it, with some deviation:
\begin{align} (a_0)_{2, n, i} \sim & \, \text{norm}((a_0)_{1, i}, \sigma_{a, 1, i})\\ \sigma_{a, 1, i} \sim & \, 50 * \text{halfnorm}(0, 1) \end{align}
$k$ is similar.
\begin{align} k_{2, n, i} \sim & \, k_{1, i} + \sigma_{k, 1, i} * \text{norm}(0, 1)\\ \sigma_{k, 1, i} \sim & \, 0.0005 * \text{halfnorm(0, 1)} \end{align}
We repeat this process to derive the mid-layer hyperparameters.
\begin{align} (a_0)_{1, i} \sim & \,(a_0)_{0} + \sigma_{a, 0} * \text{norm}(0, 1)\\ \sigma_{a, 0} \sim & \, 50 * \text{halfnorm}(0, 1)\\ k_{1, i} \sim & \, k_{0} + \sigma_{k, 0} * \text{norm}(0, 1)\\ \sigma_{k, 0} \sim & \, 0.002 * \text{halfnorm(0, 1)} \end{align}
Finally, we use simple positive priors for the hyperparameters. Note that log-normal was chosen for $k$ because it must be positive but may be small, and $a_0$ was chosen by approximating the initial cell area as around 600 µm$^2$.
\begin{align} k_{0} \sim & \, \text{lognorm}(-4.5, 0.3)\\ (a_0)_0 \sim & \, \text{norm(600, 75)} \end{align}
The linear model will be the same as the exponential model, with some small changes that will be explored later.
First, I will compile and print out the model:
ppc_exp = bebi103.stan.StanModel(file='./9.1_prior_pred_exp.stan')
print(ppc_exp.model_code)
I will be running sampling code externally in python, so I'm reloading the dataframe to be consistent with external python files I will write.
data = pd.read_csv("./9.1_data_wrangled.csv")
data = data.drop(columns = ["Unnamed: 0"])
data.head()
I need to do one last bit of wrangling in order to synthesize the arrays that stan needs. This is mostly just getting the bacteria number as an array in terms of the division occuring.
# let's just see how many divisions we have in total.
# We will need to feed this parameter into stan
N = len(div_num)
num_divisions = len(np.unique(div_num))
# I need to make an array that specifies the bacterium for each division
unique_div_nums = np.unique(div_num)
bac_id_per_div_num = np.zeros(len(unique_div_nums))
for i in range(0, len(unique_div_nums)):
div = unique_div_nums[i]
sub_df = data[data["div_num"]==div]
bac_id_per_div_num[i] = sub_df.iloc[0]["bac_id"]
print("Bacteria ID for each division number:")
print(bac_id_per_div_num)
Now I will set up the input dictionary, which will be simple given the previous data wrangling, and then perform sampling for the prior predictive check.
df_ppc_exp = None
ppc_exp_samples = None
try:
df_ppc_exp= pd.read_csv("./9.1_ppc_exp.csv")
df_ppc_exp = df_ppc_exp.drop(columns = ["Unnamed: 0"])
except FileNotFoundError:
ppc_dict = {"N":N,
"num_divisions": num_divisions,
"num_cells": 2,
"div_num": data["div_num"].values.astype(int),
"bac_id_per_div_num": bac_id_per_div_num.astype(int),
"bac_id": data["bac_id"].values.astype(int),
"time": data["time (min)"].values}
ppc_exp_samples = ppc_exp.sampling(data=ppc_dict,
algorithm='Fixed_param',
warmup=0,
chains=1,
iter=200)
df_ppc_exp = bebi103.stan.to_dataframe(ppc_exp_samples, inc_warmup = False, diagnostics=False)
df_ppc_exp.to_csv("./9.1_ppc_exp.csv")
df_ppc_exp.head()
Sampling worked! Let's first do a sanity check and see whether we are getting valid data.
print("There are %i NaN values out of %i total."
%(len(df_ppc_exp) - len(df_ppc_exp.dropna()), len(df_ppc_exp)))
This is not fantastic, but is a relatively small amount of error, and we decided to forgive it rather than further constrain the model.
Now I will make some plots. I will start by plotting all possible versions of the first bacterial division, where each line has an entirely new set of hyperparameters.
# I will start by plotting the first series
num_pts = len(data[data["div_num"]==1]["div_num"].values)
df_ppc_exp_plot = df_ppc_exp.dropna()
df_ppc_exp_plot = df_ppc_exp_plot[["areas[%i]"%i for i in range(1, num_pts + 1)]]
time = data[data["div_num"]==1]["time (min)"].values
p = bokeh.plotting.Figure(height = 500,
width = 600,
title = "Series 1 Prior Predicitve Check",
y_range = [300, 1500],
x_axis_label = "time (min)",
y_axis_label = "Cell area")
for i in range(0, len(df_ppc_exp_plot.index)):
p.line(time, df_ppc_exp_plot.iloc[i].values, line_width = 0.2)
bokeh.io.show(p)
This looks great! There is a lot of diversity in this model, both in the characteristic error of the lines and the shape of the growth. Now suppose we keep the hyperparameters constant, and plot all the growth curves for a single sample:
colors = ['red', 'blue']
df_ppc_exp_plot = df_ppc_exp.dropna()
plots = [0]*3
for j in range(0, 3):
plots[j] = bokeh.plotting.Figure(height = 350,
width = 300,
title = "Sample %i Prior Predicitve Check" %(j+1),
y_range = [300, 1500],
x_axis_label = "time (min)",
y_axis_label = "Cell area")
divs = np.unique(data["div_num"].values).astype(int)
for i in divs:
time = data[data["div_num"]==i + 1]["time (min)"].values
indices = data[data["div_num"]==i + 1].index.values + 1
values = df_ppc_exp_plot[["areas[%i]"%i for i in indices]].iloc[j].values
plots[j].line(time, values, line_width = 0.2,
# I'm going to color by cell
color = colors[(bac_id_per_div_num[i - 1]-1).astype(int)])
bokeh.io.show(bokeh.layouts.row(plots))
This is awesome! We can see that our generative model allows for substantial differences between cells, but still provides physical results. The distribution within the same cell appears to be much more well defined that the total distribution seen in the series 1 PPC.
mcmc_exp = bebi103.stan.StanModel(file='./9.1_mcmc_exp.stan')
print(mcmc_exp.model_code)
I'm going to be honest here, this model scares me a little.
I will start by running the model on a toy data set to get a rough performance estimate; I will choose two series, one from each cell, and fit them.
# I want to take random divsions. I chose 2 and 41, one from each cell
toy_data = pd.concat([data[data["div_num"]==2], data[data["div_num"]==41]])
# I need the division number to be [1, 2], and this can be changed with quick arithmetic.
toy_data.loc[:,"div_num"] = (toy_data.copy()["div_num"].values % 2) + 1
toy_data.head()
# I actually ran this externally in the file "9.1_mcmc_toy_exp.py"
df_toy_exp = None
toy_samples_exp = None
try:
df_toy_exp= pd.read_csv("./9.1_toy_mcmc_exp.csv")
df_toy_exp = df_toy_exp.drop(columns = ['Unnamed: 0'])
except FileNotFoundError:
toy_exp_dict = {"N":len(toy_data.index),
"num_divisions": 2,
"num_cells": 2,
"div_num": toy_data["div_num"].values.astype(int),
"bac_id_per_div_num": [1,2],
"bac_id": toy_data["bac_id"].values.astype(int),
"time": toy_data["time (min)"].values,
"areas": toy_data['area (sq µm)'].values}
toy_samples_exp = mcmc_exp.sampling(data=toy_exp_dict,
control=dict(adapt_delta = .98,
max_treedepth = 13),
warmup=2000,
iter=6000,
thin=1)
df_toy_exp = bebi103.stan.to_dataframe(toy_samples_exp, inc_warmup = False, diagnostics=False)
df_toy_exp.to_csv("./9.1_toy_mcmc_exp.csv")
# Check mcmc results
orig_stdout = sys.stdout
f = open('./9.1_mcmc_toy_exp_log.txt', 'a')
sys.stdout = f
bebi103.stan.check_all_diagnostics(toy_samples_exp)
sys.stdout = orig_stdout
f.close()
df_toy_exp.head()
!head -10 ./9.1_mcmc_toy_exp_log.txt
Let's check what the posteior looks like next to our data on these series!
plots = [0]*2
s = [2, 41]
indices = [0]*2
indices[0] = range(1, len(data[data["div_num"]==s[0]].index) + 1)
indices[1] = range(len(data[data["div_num"]==s[0]].index), len(toy_data.index))
for j in range(0, 2):
plots[j] = bokeh.plotting.Figure(height = 450,
width = 400,
title = "Sample %i Posterior Predicitve Check" %(s[j]),
y_range = [300, 1500],
x_axis_label = "time (min)",
y_axis_label = "Cell area")
time = data[data["div_num"]==s[j]]["time (min)"].values
for i in range(3000, 3100):
values = df_toy_exp[["areas_ppc[%i]"%k for k in indices[j]]].iloc[i].values
plots[j].line(time, values, line_width = 0.1, color = 'blue')
values = data[data["div_num"]==s[j]]["area (sq µm)"].values
plots[j].line(time, values, line_width = 2, color = 'red')
bokeh.io.show(bokeh.layouts.row(plots))
This seems to me like a pretty reasonable approximation of our data, but is a little error-prone. There is a lot of variability in Sample 41, but this makes sense given that sample 41 is a noisy sample anyways. Sample 2 is better behaved, and this is clear from the posterior. Note that the model is behaving like this because it assumes these samples are from completely different cells, and thus have very different parameters. In the full model, if cell 2 has growth phases that are better-behaved, the variance in sample 41 might be toned down.
Let's run the full thing!
!head -1000 ./9.1_mcmc_exp.py
Well that went poorly.
I ran that on my laptop overnight and it only got to sample ten. I proceeded to run it on a c5.4xLarge instance on AWS for ~7 hours, at which point it was 20% done. Then I gave up. In sum, I think the model is just too complicated to be practical for this assignment, and I should have followed Justin's advice in starting simply. This is now what I'm going to do. We didn't bother to fully work out the linear model for this case because the exponenetial failed, so I will now discuss the linear model alongside this exponential.
I still think that our error is not best modeled by a guassian, but a guassian requires fewer parameters and is thus is likely the better choice. Previously we claimed that the error for each parameter during the descent down the heiarchy was cell-dependent, and we will remove this consideration for simplicity. Lastly, we will assume that all curves for each cell are drawn from guassian distributions with the same variance per cell, centered about an exponential curve. We will still have hundreds of parameters, but perhaps this model will be capable of running. I will now spell out the structure for the exponential model:
Our hierarchical model will have 3 layers:
The model states: $$\text{Cell mass ~ Cell Area } = a_0e^{kt}$$
If we assume that the error of out measurement is guassian, we have the following: $$\text{Cell Area ~ Normal}((a_0)_{2,c, g}e^{k_{2, c, g}t}, \sigma_{2, c})$$ where the subscript $g$ denotes dependence on the growth phase, $c$ denotes dependence on cell, and the first subscript is the layer in the hierarchy.
The prior on $\sigma_{2, c}$ is simple, and has no heirarchy. I'll make it half-normal about 20 µm$^2$. $$\sigma_{2, c} \sim 20 * \text{halfnormal}(0, 1)$$
$a_0$ from the bottom layer is found from the $a_0$ above for each cell, with some deviation:
\begin{align} (a_0)_{2, c, g} \sim & \, \text{norm}((a_0)_{1, c}, \sigma_{1, a})\\ \sigma_{1, a} \sim & \, 30 * \text{halfnorm}(0, 1) \end{align}
$k$ is similar.
\begin{align} k_{2, c, g} \sim & \, k_{1, c} + \sigma_{1, k} * \text{norm}(0, 1)\\ \sigma_{1, k} \sim & \, 0.001 * \text{halfnorm(0, 1)} \end{align}
We repeat this process to derive the mid-layer hyperparameters.
\begin{align} (a_0)_{1, c} \sim & \,(a_0)_{0} + \sigma_{a, 0} * \text{norm}(0, 1)\\ \sigma_{a, 0} \sim & \, 30 * \text{halfnorm}(0, 1)\\ k_{1, c} \sim & \, k_{0} + \sigma_{k, 0} * \text{norm}(0, 1)\\ \sigma_{k, 0} \sim & \, 0.001 * \text{halfnorm(0, 1)} \end{align}
Finally, we use simple positive priors for the hyperparameters. We have changed $k$ to be normal around the range we were getting before because we suspsected that it would sample nicer, and $a_0$ was chosen by approximating the initial cell area as around 600 µm$^2$.
\begin{align} k_{0} \sim & \, \text{Normal}(0.011, 0.001)\\ (a_0)_0 \sim & \, \text{norm(600, 75)} \end{align} The model still looks hefty, but it is less-so than before, and still has the benefits of a model that can individually parametrize each growth curve. I'll code it up in stan.
We will follow a very similar structure for the linear model, except that we will change the priors and the variances on $b$, the slope. We used the following parameters (I won't spell out how exactly they fit because they are so simiilarly used in the exponential model). \begin{gather} b_{0} \sim & \, \text{Normal}(5, 1)\\ \sigma_{b, 0} \sim & \, 0.2 * \text{halfnorm(0, 1)}\\ b_{1, c} \sim & \, b_{0} + \sigma_{b, 0} * \text{norm}(0, 1)\\ \sigma_{1, b} \sim & \, 0.2 * \text{halfnorm(0, 1)}\\ b_{2, c, g} \sim & \, b_{1, c} + \sigma_{1, b} * \text{norm}(0, 1) \end{gather}
ppc_exp_2 = bebi103.stan.StanModel(file='./9.1_prior_pred_exp_2.stan')
print(ppc_exp_2.model_code)
Sample and save results:
df_ppc_exp_2 = None
ppc_exp_samples_2 = None
try:
df_ppc_exp_2 = pd.read_csv("./9.1_ppc_exp_2.csv")
df_ppc_exp_2 = df_ppc_exp_2.drop(columns = ["Unnamed: 0"])
except FileNotFoundError:
ppc_dict = {"N":N,
"num_divisions": num_divisions,
"num_cells": 2,
"div_num": data["div_num"].values.astype(int),
"bac_id_per_div_num": bac_id_per_div_num.astype(int),
"bac_id": data["bac_id"].values.astype(int),
"time": data["time (min)"].values}
ppc_exp_samples_2 = ppc_exp_2.sampling(data=ppc_dict,
algorithm='Fixed_param',
warmup=0,
chains=1,
iter=200)
df_ppc_exp_2 = bebi103.stan.to_dataframe(ppc_exp_samples_2, inc_warmup = False, diagnostics=False)
df_ppc_exp_2.to_csv("./9.1_ppc_exp_2.csv")
print("There are %i NaN values out of %i total."
%(len(df_ppc_exp_2) - len(df_ppc_exp_2.dropna()), len(df_ppc_exp_2)))
df_ppc_exp_2.head()
Plot the results as we did before, and see if they are sensible. Start with checking the diversity of the model on the first series:
# I will start by plotting the first series
num_pts = len(data[data["div_num"]==1]["div_num"].values)
df_ppc_exp_plot = df_ppc_exp_2.dropna()
df_ppc_exp_plot = df_ppc_exp_plot[["areas[%i]"%i for i in range(1, num_pts + 1)]]
time = data[data["div_num"]==1]["time (min)"].values
p = bokeh.plotting.Figure(height = 500,
width = 600,
title = "Series 1 Prior Predicitve Check",
y_range = [300, 1500],
x_axis_label = "time (min)",
y_axis_label = "Cell area")
for i in range(0, len(df_ppc_exp_plot.index)):
p.line(time, df_ppc_exp_plot.iloc[i].values, line_width = 0.2)
bokeh.io.show(p)
So far it looks like the simplicty cost us very little in diversity!
Now let's check how the curves look given constant hyperparameters:
colors = ['red', 'blue']
df_ppc_exp_plot = df_ppc_exp_2.dropna()
plots = [0]*3
for j in range(0, 3):
plots[j] = bokeh.plotting.Figure(height = 350,
width = 300,
title = "Sample %i Prior Predicitve Check" %(j+1),
y_range = [300, 1500],
x_axis_label = "time (min)",
y_axis_label = "Cell area")
divs = np.unique(data["div_num"].values).astype(int)
for i in divs:
time = data[data["div_num"]==i + 1]["time (min)"].values
indices = data[data["div_num"]==i + 1].index.values + 1
values = df_ppc_exp_plot[["areas[%i]"%i for i in indices]].iloc[j].values
plots[j].line(time, values, line_width = 0.2,
# I'm going to color by cell
color = colors[(bac_id_per_div_num[i - 1]-1).astype(int)])
bokeh.io.show(bokeh.layouts.row(plots))
These plots are so cool. I love bokeh. Anyways, the prior-predictive checks look good to me!
I will start by loading and printing the stan model:
ppc_lin = bebi103.stan.StanModel(file='./9.1_prior_pred_lin.stan')
print(ppc_lin.model_code)
df_ppc_lin = None
try:
df_ppc_lin = pd.read_csv("./9.1_ppc_lin.csv")
df_ppc_lin = df_ppc_lin.drop(columns = ["Unnamed: 0"])
except FileNotFoundError:
ppc_dict = {"N":N,
"num_divisions": num_divisions,
"num_cells": 2,
"div_num": data["div_num"].values.astype(int),
"bac_id_per_div_num": bac_id_per_div_num.astype(int),
"bac_id": data["bac_id"].values.astype(int),
"time": data["time (min)"].values}
ppc_lin_samples = ppc_lin.sampling(data=ppc_dict,
algorithm='Fixed_param',
warmup=0,
chains=1,
iter=200)
df_ppc_lin = bebi103.stan.to_dataframe(ppc_lin_samples, inc_warmup = False, diagnostics=False)
df_ppc_lin.to_csv("./9.1_ppc_lin.csv")
print("There are %i NaN values out of %i total."
%(len(df_ppc_lin) - len(df_ppc_lin.dropna()), len(df_ppc_lin)))
df_ppc_lin.head()
Plot results on a single curve for all sets of hyperparameters:
# I will start by plotting the first series
num_pts = len(data[data["div_num"]==1]["div_num"].values)
df_ppc_lin_plot = df_ppc_lin.dropna()
df_ppc_lin_plot = df_ppc_lin_plot[["areas[%i]"%i for i in range(1, num_pts + 1)]]
time = data[data["div_num"]==1]["time (min)"].values
p = bokeh.plotting.Figure(height = 500,
width = 600,
title = "Series 1 Prior Predicitve Check",
y_range = [300, 1500],
x_axis_label = "time (min)",
y_axis_label = "Cell area")
for i in range(0, len(df_ppc_lin_plot.index)):
p.line(time, df_ppc_lin_plot.iloc[i].values, line_width = 0.2)
bokeh.io.show(p)
And plot results for the same parameters on multiple curves!
colors = ['red', 'blue']
df_ppc_lin_plot = df_ppc_lin.dropna()
plots = [0]*3
for j in range(0, 3):
plots[j] = bokeh.plotting.Figure(height = 350,
width = 300,
title = "Sample %i Prior Predicitve Check" %(j+1),
y_range = [300, 1500],
x_axis_label = "time (min)",
y_axis_label = "Cell area")
divs = np.unique(data["div_num"].values).astype(int)
for i in divs:
time = data[data["div_num"]==i + 1]["time (min)"].values
indices = data[data["div_num"]==i + 1].index.values + 1
values = df_ppc_lin_plot[["areas[%i]"%i for i in indices]].iloc[j].values
plots[j].line(time, values, line_width = 0.2,
# I'm going to color by cell
color = colors[(bac_id_per_div_num[i - 1]-1).astype(int)])
bokeh.io.show(bokeh.layouts.row(plots))
mcmc_exp_2 = bebi103.stan.StanModel(file='./9.1_mcmc_exp_2.stan')
print(mcmc_exp_2.model_code)
Let's see how this model performs on the toy problem I set up for the first model:
df_toy_exp_2 = None
toy_samples_exp_2 = None
try:
df_toy_exp_2 = pd.read_csv("./9.1_toy_mcmc_exp_2.csv")
df_toy_exp_2 = df_toy_exp_2.drop(columns = ['Unnamed: 0'])
except FileNotFoundError:
toy_exp_dict = {"N":len(toy_data.index),
"num_divisions": 2,
"num_cells": 2,
"div_num": toy_data["div_num"].values.astype(int),
"bac_id_per_div_num": [1,2],
"bac_id": toy_data["bac_id"].values.astype(int),
"time": toy_data["time (min)"].values,
"areas": toy_data['area (sq µm)'].values}
toy_samples_exp_2 = mcmc_exp_2.sampling(data=toy_exp_dict,
control=dict(adapt_delta = .97,
max_treedepth = 13),
warmup=2000,
iter=6000,
thin=1)
df_toy_exp_2 = bebi103.stan.to_dataframe(toy_samples_exp_2,
inc_warmup = False,
diagnostics=False)
df_toy_exp_2.to_csv("./9.1_toy_mcmc_exp_2.csv")
# Check mcmc results, printing to log file in the event that the error
# message is several megabytes (yes this happened).
orig_stdout = sys.stdout
f = open('./9.1_mcmc_toy_exp_2_log.txt', 'a')
sys.stdout = f
print("")
print("-----------------------------Running new model!---------------------------")
bebi103.stan.check_all_diagnostics(toy_samples_exp_2)
sys.stdout = orig_stdout
f.close()
df_toy_exp_2.head()
Check diagnostics:
!tail -10 ./9.1_mcmc_toy_exp_2_log.txt
The stats look great! Let's make a plot of the posterior to see if our model has learned!
plots = [0]*2
s = [2, 41]
indices = [0]*2
indices[0] = range(1, 97)
indices[1] = range(97, 188)
for j in range(0, 2):
plots[j] = bokeh.plotting.Figure(height = 450,
width = 400,
title = "Sample %i Posterior Predicitve Check" %(s[j]),
y_range = [300, 1500],
x_axis_label = "time (min)",
y_axis_label = "Cell area")
time = data[data["div_num"]==s[j]]["time (min)"].values
for i in range(0, 100):
values = df_toy_exp_2[["areas_ppc[%i]"%k for k in indices[j]]].iloc[i].values
plots[j].line(time, values, line_width = 0.1, color = 'blue')
a = [1,2]
values = toy_data[toy_data["div_num"]==a[j]]["area (sq µm)"].values
plots[j].line(time, values, line_width = 2, color = 'red')
bokeh.io.show(bokeh.layouts.row(plots))
This took a lot of work, but finally the checks look good.
This model still takes a lot of time to run, but it might be a more managable amount of time than the last. I'll set up a python file to run the full thing:
!head -1000 ./9.1_mcmc_exp_2.py
Using 9 cores and 9 chains to parallelize, this model actually ran succsessfully! I actually had to run the second part of this program later, since I ran out of hard drive space on the aws machine. (As a fix, I started a python kernel, loaded the pkl file, deleted the pkl file, and wrote to csv).
I'll print out the error log and the calculated stats.
!head -100 ./9.1_mcmc_exp_2_log.txt
On the whole I would say the statistics look pretty good. We will come back to the analysis of our posterior. Let's transition back to the linear model.
Let's see if we can learn on the same toy set we've used twice.
mcmc_lin = bebi103.stan.StanModel(file='./9.1_mcmc_lin.stan')
print(mcmc_lin.model_code)
Run MCMC!
df_toy_lin = None
try:
df_toy_lin = pd.read_csv("./9.1_mcmc_toy_lin.csv")
df_toy_lin = df_toy_lin.drop(columns = ['Unnamed: 0'])
except FileNotFoundError:
toy_exp_dict = {"N":len(toy_data.index),
"num_divisions": 2,
"num_cells": 2,
"div_num": toy_data["div_num"].values.astype(int),
"bac_id_per_div_num": [1,2],
"bac_id": toy_data["bac_id"].values.astype(int),
"time": toy_data["time (min)"].values,
"areas": toy_data['area (sq µm)'].values}
toy_samples_lin = mcmc_lin.sampling(data=toy_exp_dict,
control=dict(adapt_delta = .90,
max_treedepth = 10),
warmup=2000,
iter=6000,
thin=1)
df_toy_lin = bebi103.stan.to_dataframe(toy_samples_lin,
inc_warmup = False,
diagnostics=False)
df_toy_lin.to_csv("./9.1_mcmc_toy_lin.csv")
# Check mcmc results, printing to log file in the event that the error
# message is several megabytes (yes this happened).
orig_stdout = sys.stdout
f = open('./9.1_mcmc_toy_lin_log.txt', 'a')
sys.stdout = f
print("")
print("-----------------------------Running new model!---------------------------")
bebi103.stan.check_all_diagnostics(toy_samples_lin)
sys.stdout = orig_stdout
f.close()
df_toy_lin.head()
Print the log:
!head -10 ./9.1_mcmc_toy_lin_log.txt
Lets do a check to make sure that we are learning with our model on the toy problem.
plots = [0]*2
s = [2, 41]
indices = [0]*2
indices[0] = range(1, 97)
indices[1] = range(97, 188)
for j in range(0, 2):
plots[j] = bokeh.plotting.Figure(height = 450,
width = 400,
title = "Sample %i Posterior Predicitve Check" %(s[j]),
y_range = [300, 1500],
x_axis_label = "time (min)",
y_axis_label = "Cell area")
time = data[data["div_num"]==s[j]]["time (min)"].values
for i in range(3000, 3050):
values = df_toy_lin[["areas_ppc[%i]"%k for k in indices[j]]].iloc[i].values
plots[j].line(time, values, line_width = 0.1, color = 'blue')
a = [1,2]
values = toy_data[toy_data["div_num"]==a[j]]["area (sq µm)"].values
plots[j].line(time, values, line_width = 2, color = 'red')
bokeh.io.show(bokeh.layouts.row(plots))
This looks very reasonable! Already, we can see that the model is forced to compensate for the curvature of the graph on the left by using a higher error, demonstrating that the exponential model can make a tighter fit. That said, our model statistics will tell us whether this is simply overfitting.
I'm going to try running a python file on a 4-core AWS machine. This seems like a reasonable approach since this model is not as heavy as the last. I'll also make sure to have enough hard drive space this time :)
Additionally, thinking ahead, I do not really like data files that are too big to load into RAM, so I'll thin some and collect fewer points than for the exponential model. When I was running the model, I kept getting divergences and tree depth saturation, so I had to crank up "adapt delta" and the tree depth.
# print python file
!head -100 ./9.1_mcmc_lin.py
The model ran in much less time than the exponential. Printing the log:
!head -100 ./9.1_mcmc_lin_log.txt
Stats look good, on the scale of things! Right off the bat, we can see that the waic and loo are higher for this model than for the exponential. Thus, we expect that the posterior of the linear model will not be as informative as the exponential's posterior.
I will start by looking at the posterior in comparison to the data for some random curves in the model. Loading this data is a little bit of a unique challenge given that I don't actually have enough RAM on my computer to load a 10GB csv into python and operate on it with pandas. Not to mention that I'm not sending this massive file over githib either, so you don't even have it! Luckily, pandas has thought ahead for people like me, and I can read it chunk at a time! I'll save a chunk of it so that I can make some plots, and then I'll iterate through the whole thing and just collect parameters of interest.
try:
df_exp = pd.read_csv("./9.1_for_exp_analysis.csv")
df_exp = df_exp.drop(columns = ["Unnamed: 0"])
except FileNotFoundError:
# I am synthesizing a dataframe with all parameters I care about.
k = np.zeros(18000)
a0 = np.zeros(18000)
sigma_2_1 = np.zeros(18000)
sigma_2_2 = np.zeros(18000)
a0_1_1 = np.zeros(18000)
a0_1_2 = np.zeros(18000)
k_1_1 = np.zeros(18000)
k_1_2 = np.zeros(18000)
sigma_a_0 = np.zeros(18000)
sigma_k_0 = np.zeros(18000)
sigma_a_1 = np.zeros(18000)
sigma_k_1 = np.zeros(18000)
# Load in the data
df_exp = pd.read_csv("./9.1_mcmc_exp_2.csv", chunksize = 250)
# Iterate through the file and save parameters of interest
i = 0
for chunk in df_exp:
sys.stdout.write("\r%.2f percent complete." % (100 * i * 250 / 18000))
for j in range(0, 250):
m = j + 250 * i
k[m] = chunk.iloc[j]["k"]
a0[m] = chunk.iloc[j]["a0"]
sigma_2_1[m] = chunk.iloc[j]["sigma_2[1]"]
sigma_2_2[m] = chunk.iloc[j]["sigma_2[2]"]
a0_1_1[m] = chunk.iloc[j]["a0_1[1]"]
a0_1_2[m] = chunk.iloc[j]["a0_1[2]"]
k_1_1[m] = chunk.iloc[j]["k_1[1]"]
k_1_2[m] = chunk.iloc[j]["k_1[2]"]
sigma_a_0[m] = chunk.iloc[j]["sigma_a_0"]
sigma_k_0[m] = chunk.iloc[j]["sigma_k_0"]
sigma_a_1[m] = chunk.iloc[j]["sigma_a_1"]
sigma_k_1[m] = chunk.iloc[j]["sigma_k_1"]
i += 1
sys.stdout.write("\r%.2f percent complete. " % 100)
df_exp = pd.DataFrame(np.transpose([k,
a0,
sigma_2_1,
sigma_2_2,
a0_1_1,
a0_1_2,
k_1_1,
k_1_2,
sigma_a_0,
sigma_a_1,
sigma_k_1]),
columns = ["k",
"a0",
"sigma_2[1]",
"sigma_2[2]",
"a0_1[1]",
"a0_1[2]",
"k_1[1]",
"k_1[2]",
"sigma_a_0",
"sigma_a_1",
"sigma_k_1"])
df_exp.to_csv("./9.1_for_exp_analysis.csv")
df_exp.head()
Saving a chunk to make plots with:
df_chunk = None
try:
df_chunk = pd.read_csv("./9.1_exp_chunk.csv")
df_chunk = df_chunk.drop(columns=["Unnamed: 0"])
except FileNotFoundError:
df = pd.read_csv("./9.1_mcmc_exp_2.csv", chunksize = 100)
for chunk in df_exp:
chunk.to_csv("./9.1_exp_chunk.csv")
df_chunk = chunk
break
Making the plots I promised:
def post_check(samples_df, s):
plots = [0]*4
indices = [0]*len(plots)
for i in range(len(indices)):
indices[i] = data[data["div_num"]==s[i]].index.values + 1
for j in range(len(plots)):
plots[j] = bokeh.plotting.Figure(height = 300,
width = 250,
title = "Sample %i Post. PC" %(s[j]),
y_range = [300, 1500],
x_axis_label = "time (min)",
y_axis_label = "Cell area")
time = data[data["div_num"]==s[j]]["time (min)"].values
for i in range(0, 100):
values = samples_df[["areas_ppc[%i]"%k for k in indices[j]]].iloc[i].values
plots[j].line(time, values, line_width = 0.1, color = 'blue')
values = data[data["div_num"]==s[j]]["area (sq µm)"].values
plots[j].line(time, values, line_width = 2, color = 'red')
bokeh.io.show(bokeh.layouts.row(plots))
post_check(df_chunk, [3,7, 42, 50])
Despite some bounds errors that I must have missed, this looks very reasonable to me. Since the error is now based on how well it can fit all the curves from a cell, the posteror does not tightly hug each curve, nor does it expand to fit the large error of an individual curve. Let's check out how some of our hyperparameters have changed from their prior:
def plot_param_with_prior(param, prior_df, post_df):
"""Accepts the parameters for which to plot an ECDF, and the dataframes from which to plot them.
Used to compare posterior and prior distributions"""
p = bebi103.viz.ecdf(prior_df[param].values, legend = "Prior", color = "green", title = param)
p = bebi103.viz.ecdf(post_df[param].values, p=p, legend = "Posterior", color = "blue")
p.legend.location = 'bottom_right'
return p
params=["k",
"a0",
"sigma_2[1]",
"sigma_2[2]",
"a0_1[1]",
"a0_1[2]",
"k_1[1]",
"k_1[2]",
"sigma_a_0",
"sigma_a_1",
"sigma_k_1"]
for param in params:
bokeh.io.show(plot_param_with_prior(param, df_ppc_exp_2, df_exp))
What an awesome visualization! We clearly learned a great deal from our data, given that the prior ECDFs are mostly much more broad than the posteriors. We can see a realatively low variance in $k$, but a high variance in $a_0$, at least in comparison to our priors.
Extract the data of interest from the .csv.
try:
df_lin = pd.read_csv("./9.1_for_lin_analysis.csv")
df_lin = df_lin.drop(columns = ["Unnamed: 0"])
except FileNotFoundError:
# I am synthesizing a dataframe with all parameters I care about.
b = np.zeros(2668)
a0 = np.zeros(2668)
sigma_2_1 = np.zeros(2668)
sigma_2_2 = np.zeros(2668)
a0_1_1 = np.zeros(2668)
a0_1_2 = np.zeros(2668)
b_1_1 = np.zeros(2668)
b_1_2 = np.zeros(2668)
sigma_a_0 = np.zeros(2668)
sigma_b_0 = np.zeros(2668)
sigma_a_1 = np.zeros(2668)
sigma_b_1 = np.zeros(2668)
# Load in the data
df_lin = pd.read_csv("./9.1_mcmc_lin.csv", chunksize = 667)
# Iterate through the file and save parameters of interest
i = 0
for chunk in df_lin:
sys.stdout.write("\r%.2f percent complete." % (100 * i * 667 / 2668))
for j in range(0, 667):
m = j + 667 * i
b[m] = chunk.iloc[j]["b"]
a0[m] = chunk.iloc[j]["a0"]
sigma_2_1[m] = chunk.iloc[j]["sigma_2[1]"]
sigma_2_2[m] = chunk.iloc[j]["sigma_2[2]"]
a0_1_1[m] = chunk.iloc[j]["a0_1[1]"]
a0_1_2[m] = chunk.iloc[j]["a0_1[2]"]
b_1_1[m] = chunk.iloc[j]["b_1[1]"]
b_1_2[m] = chunk.iloc[j]["b_1[2]"]
sigma_a_0[m] = chunk.iloc[j]["sigma_a_0"]
sigma_b_0[m] = chunk.iloc[j]["sigma_b_0"]
sigma_a_1[m] = chunk.iloc[j]["sigma_a_1"]
sigma_b_1[m] = chunk.iloc[j]["sigma_b_1"]
i += 1
sys.stdout.write("\r%.2f percent complete. " % 100)
df_lin = pd.DataFrame(np.transpose([b,
a0,
sigma_2_1,
sigma_2_2,
a0_1_1,
a0_1_2,
b_1_1,
b_1_2,
sigma_a_0,
sigma_a_1,
sigma_b_1]),
columns = ["b",
"a0",
"sigma_2[1]",
"sigma_2[2]",
"a0_1[1]",
"a0_1[2]",
"b_1[1]",
"b_1[2]",
"sigma_a_0",
"sigma_a_1",
"sigma_b_1"])
df_lin.to_csv("./9.1_for_lin_analysis.csv")
df_lin.head()
Get a chunk of the whole posterior to make plots.
df_chunk = None
try:
df_chunk = pd.read_csv("./9.1_lin_chunk.csv")
df_chunk = df_chunk.drop(columns=["Unnamed: 0"])
except FileNotFoundError:
df = pd.read_csv("./9.1_mcmc_lin.csv", chunksize = 100)
for chunk in df:
chunk.to_csv("./9.1_lin_chunk.csv")
df_chunk = chunk
break
Make the same plots I made before for the linear model!
post_check(df_chunk, [3,7, 42, 50])
We can clearly see that the error has increased for both cells to accommodate for the curved experimental trace.
The last plots I want to make are those that display shrinkage for the linear model. This is a quick loop:
params=["b",
"a0",
"sigma_2[1]",
"sigma_2[2]",
"a0_1[1]",
"a0_1[2]",
"b_1[1]",
"b_1[2]",
"sigma_a_0",
"sigma_a_1",
"sigma_b_1"]
for param in params:
bokeh.io.show(plot_param_with_prior(param, df_ppc_lin, df_lin))
The shrinkage was fantastic for this model; perhaps even superior to the shrinkage for the exponential model. These ECDFs commonly look like delta functions, with the exception of the error in the top-level intercept hyperparameter $a_{00}$, which appears to have a mode at zero and a slow decay.
To comapare the linear and exponential models, I will calculate the Akaike weights from the LOO values that I recorded in the log files for mcmc, using the actual samples object. The Akaike weights are given by the following formula (from the tutorial):
\begin{align} w_i = \frac{\exp\left[-(\text{LOO}_i-\text{LOO}_j)/2\right]}{1 + \exp\left[-(\text{LOO}_i-\text{LOO}_j)/2\right]}. \end{align}
exp_loo = np.float128(96284.754451)
lin_loo = np.float128(105282.039389)
d_loo = (lin_loo - exp_loo)
w_exp = np.exp(d_loo/2) / (1 + np.exp(d_loo/2))
w_lin = 1 - w_exp
print(' Linear model weight:', w_lin)
print('Exponential model weight:', w_exp)
We are very clearly getting precision errors in this calculation, but I think it's really besides the point. The true model is clearly exponential, not linear. I am actually amazed that the result is so specatular!
We can conclude with confidence that cell growth is exponential, even at the scale of a single cell!
%load_ext watermark
%watermark -v -p numpy,pandas,pystan,bokeh,altair,bebi103,jupyterlab